车道检测是自动驾驶中的基本模块之一。在本文中,我们采用了一种仅变压器的方法来进行车道检测,因此,它可以受益于完全视觉变压器的开发,并通过精细的 - 通过精细 - 通过精细 - 通过精细的 - 调整重量在大型数据集上进行全面训练。更重要的是,本文提出了一个名为Priorlane的新颖和一般框架,该框架用于通过引入低成本的局部先验知识来增强完全视觉变压器的分割性能。 PriorLane利用仅编码变压器来融合由预训练的分割模型与先验知识嵌入的功能融合。请注意,知识嵌入对齐(KEA)模块可通过对齐知识嵌入来提高融合性能。我们ZJLAB数据集的广泛实验表明,Prior-Lane以2.82%MIOU优于SOTA LANE检测方法,并且该代码将在以下位置发布:https:// github。 com/vincentqqb/priorlane。
translated by 谷歌翻译
联合超分辨率和反音调映射(SR-ITM)旨在提高具有分辨率和动态范围具有质量缺陷的视频的视觉质量。当使用4K高动态范围(HDR)电视来观看低分辨率标准动态范围(LR SDR)视频时,就会出现此问题。以前依赖于学习本地信息的方法通常在保留颜色合规性和远程结构相似性方面做得很好,从而导致了不自然的色彩过渡和纹理伪像。为了应对这些挑战,我们建议联合SR-ITM的全球先验指导的调制网络(GPGMNET)。特别是,我们设计了一个全球先验提取模块(GPEM),以提取颜色合规性和结构相似性,分别对ITM和SR任务有益。为了进一步利用全球先验并保留空间信息,我们使用一些用于中间特征调制的参数,设计多个全球先验的指导空间调制块(GSMB),其中调制参数由共享的全局先验和空间特征生成来自空间金字塔卷积块(SPCB)的地图。通过这些精心设计的设计,GPGMNET可以通过较低的计算复杂性实现更高的视觉质量。广泛的实验表明,我们提出的GPGMNET优于最新方法。具体而言,我们提出的模型在PSNR中超过了0.64 dB的最新模型,其中69 $ \%$ $ $较少,3.1 $ \ times $ speedup。该代码将很快发布。
translated by 谷歌翻译
在本文中,我们介绍了VCSL(视频复制段本地化),这是一种新的综合段级注释的视频复制数据集。与受视频级注释或小规模限制的现有复制检测数据集相比,VCSL不仅具有两个段级标签的数据级,其中有160k现实的视频副本对,其中包含超过280k的本地化copied seggment对,而且还包含超过280k涵盖各种视频类别和各种视频持续时间。每个收集的视频对中的所有复制段均经过手动提取,并伴随着精确注释的启动和结束时间戳。除了数据集外,我们还提出了一种新颖的评估协议,该协议可以更好地衡量视频对之间复制重叠段的预测准确性,并在不同情况下显示出改善的适应性。通过使用拟议的数据集和评估指标对几个基线和最先进的细分级视频副本检测方法进行基准测试,我们提供了一项全面的分析,可以揭示当前方法的优势和劣势作品。 VCSL数据集,公制和基准代码均在https://github.com/alipay/vcsl上公开获得。
translated by 谷歌翻译
在大多数视频平台(如youtube和Tiktok)中,播放的视频通常经过多个视频编码,例如通过记录设备,视频编辑应用程序的软件编码,以及视频应用程序服务器的单个/多个视频转码。以前的压缩视频恢复工作通常假设压缩伪像是由一次性编码引起的。因此,衍生的解决方案通常在实践中通常不起作用。在本文中,我们提出了一种新的方法,时间空间辅助网络(TSAN),用于转码视频恢复。我们的方法考虑了视频编码和转码之间的独特特征,我们将初始浅编码视频视为中间标签,以帮助网络进行自我监督的注意培训。此外,我们采用相邻的多帧信息,并提出用于转码视频恢复的时间可变形对准和金字塔空间融合。实验结果表明,该方法的性能优于以前的技术。代码可在https://github.com/iceCherylxuli/tsan获得。
translated by 谷歌翻译
在本文中,我们提出了一种强大的样本生成方案来构建信息性三联网。所提出的硬样品生成是一种两级合成框架,通过两个阶段的有效正和负样品发生器产生硬样品。第一阶段将锚定向对具有分段线性操作,通过巧妙地设计条件生成的对抗网络来提高产生的样本的质量,以降低模式崩溃的风险。第二阶段利用自适应反向度量约束来生成最终的硬样本。在几个基准数据集上进行广泛的实验,验证了我们的方法比现有的硬样生成算法达到卓越的性能。此外,我们还发现,我们建议的硬样品生成方法结合现有的三态挖掘策略可以进一步提高深度度量学习性能。
translated by 谷歌翻译
本文介绍了语音(TTS)系统的Microsoft端到端神经文本:暴风雪挑战2021。这一挑战的目标是从文本中综合自然和高质量的演讲,并在两个观点中接近这一目标:首先是直接模型,并在48 kHz采样率下产生波形,这比以前具有16 kHz或24 kHz采样率的先前系统带来更高的感知质量;第二个是通过系统设计来模拟语音中的变化信息,从而提高了韵律和自然。具体而言,对于48 kHz建模,我们预测声学模型中的16 kHz熔点 - 谱图,并提出称为HIFINET的声码器直接从预测的16kHz MEL谱图中产生48kHz波形,这可以更好地促进培训效率,建模稳定性和语音。质量。我们从显式(扬声器ID,语言ID,音高和持续时间)和隐式(话语级和音素级韵律)视角系统地模拟变化信息:1)对于扬声器和语言ID,我们在培训和推理中使用查找嵌入; 2)对于音高和持续时间,我们在训练中提取来自成对的文本语音数据的值,并使用两个预测器来预测推理中的值; 3)对于话语级和音素级韵律,我们使用两个参考编码器来提取训练中的值,并使用两个单独的预测器来预测推理中的值。此外,我们介绍了一个改进的符合子块,以更好地模拟声学模型中的本地和全局依赖性。对于任务SH1,DelightFultts在MOS测试中获得4.17均匀分数,4.35在SMOS测试中,表明我们所提出的系统的有效性
translated by 谷歌翻译
人工智能(AI)最近展示了它几乎所有生活领域的能力。机器学习是AI的一个子集,是研究人员的“热门”主题。机器学习在几乎全自然应用中优于其他经典预测技术。这是现代研究的关键部分。根据本声明,现代机器学习算法令人渴望大数据。由于小型数据集,研究人员可能不喜欢使用机器学习算法。为了解决这个问题,本调查的主要目的是说明,证明相关的研究,以了解称为灰色机器学习(GML)的半参数机学习框架的重要性。这种框架能够处理大型数据集以及用于时间序列预测可能结果的小型数据集。该调查概述了现有的时间序列预测的半参数机学习技术。本文为研究人员提供了关于GML框架的引物调查。为了允许对读者进行深入的理解,讨论了机器学习的简要描述,以及各种形式的传统灰色预测模型。此外,介绍了关于GML框架的重要性的简要说明。
translated by 谷歌翻译
In this paper, we study the problem of image-text matching. Inferring the latent semantic alignment between objects or other salient stuff (e.g. snow, sky, lawn) and the corresponding words in sentences allows to capture fine-grained interplay between vision and language, and makes image-text matching more interpretable. Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable. In this paper, we present Stacked Cross Attention to discover the full latent alignments using both image regions and words in a sentence as context and infer image-text similarity. Our approach achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. On Flickr30K, our approach outperforms the current best methods by 22.1% relatively in text retrieval from image query, and 18.2% relatively in image retrieval with text query (based on Recall@1). On MS-COCO, our approach improves sentence retrieval by 17.8% relatively and image retrieval by 16.6% relatively (based on Recall@1 using the 5K test set). Code has been made available at: https: //github.com/kuanghuei/SCAN.
translated by 谷歌翻译
Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译